Random Constraint Satisfaction Problems 



Amin Coja-Oghlan* 

University of Edinburgh, School of Informatics, Edinburgh EH8 9AB, UK 
acoghlanOinf .ed.ac.uk 

Random instances of constraint satisfaction problems such as fc-SAT provide challenging bench- 
marks. If there are m constraints over n variables there is typically a large range of densities r = m/n 
where solutions are known to exist with probability close to one due to non-constructive arguments. 
However, no algorithms are known to find solutions efficiently with a non-vanishing probability at 
even much lower densities. This fact appears to be related to a phase transition in the set of all solu- 
tions. The goal of this extended abstract is to provide a perspective on this phenomenon, and on the 
computational challenge that it poses. 

1 A computational challenge 

Numerous constraint satisfaction problems ("CSPs") are well known to be NP-hard. Examples of such 
problems are for any k > 3 

&-SAT. The input is a prepositional formula in conjunctive normal form 3> = 3>i A • ■ • A <E> m , where each 
clause <!>,- is a disjunction of k literals over a set of Boolean variables {x\,.. . ,x„}. The goal is to 
decide whether there is an assignment of x\, . . . ,x n such that the expression <1> evaluates to true, 
and if so, to find such a satisfying assignment. 

fc-NAE. The input is a prepositional formula as in &-SAT. The objective is to decide whether there is 
an assignment of x\ , . . . , x n such that each clause contains both a true and a false literal ("Not All 
Equal"). 

^-coloring. Given a (simple, undirected) graph G = (V,E), decide whether there is a k-coloring, i.e., an 
assignment V — > {1, . . . ,k} of "colors" to the vertices such that for any edge e = {v,w} € E the 
vertices v, w are assigned different colors. 

We will call the desired type of assignment in each case a solution. 

Since the above problems are NP-hard, no efficient algorithm is known to solve all possible problem 
instances. However, the theory of NP-hardness does not provide us with an explicit class of "hard" 
problem instances. It just shows that all NP-complete problems are "equally hard". Furthermore, NP- 
hardness merely suggests that there exist hard problem instances. This does not rule out efficient heuristic 
algorithms that succeed on "most" inputs. 

Yet it is surprisingly simple to generate problem instances that seem to elude all known heuristics. In 
fact, in all of the above problems, randomly generated instances provide extremely challenging inputs. 
Throughout we denote the number of variables by n and the number of constraints by m. In the case of 
&-SAT and &-NAE "variable" has the obvious meaning. In ^-coloring "variables" refers to the vertices 
of the graph. Moreover, the constraints are the edges (in ^-coloring) resp. the clauses (in &-SAT and k- 
NAE). We consider random problem instances generated by just picking a set of m constraints uniformly 
at random. In &-SAT or &-NAE this yields a random propositional formula Fk(n,m), and in ^-coloring a 
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random graph G(n,m). Generally we will be interested in "large" instances, i.e., n — > oo. Furthermore, 
mostly the constraint density r = m/n will remain bounded as n gets large. We say that the random 
instance has a property S with high probability ("w.h.p.") if the probability that $ holds tends to one as 
n — > oo. 

For a large range of densities r non-constructive arguments show that the random instance F k (n,m) 
or G(n,m) has a solution w.h.p., but no efficient algorithm is known to find one with a non-vanishing 
probability. Thus, random constraint satisfaction problems pose an algorithmic challenge. They have 
withstood more than 20 years of extensive research efforts. The aim of this short paper is to provide a 
perspective on this instructive class of instances. 



2 For what densities do solutions exist? 

In each of k-SAT, &-NAE, and ^-coloring there is a sharp threshold r# such that for densities r < r& — e 
there is a solution w.h.p., while for densities r > r# + e no solution exists w.h.p., for any fixed e > 
12l[T6|. Actually the threshold r\ is non-uniform, i.e., r k = r k (n) is not a fixed number but may depend 
on n. However, r^(n) is conjectured to converge. 

The threshold r k = r k (n) is not known precisely for any k > 3 (and large n), but asymptotically tight 
bounds in the large k limit are. An upper bound on r# can be obtained by computing the expected number 
of solutions. If r is such that the expected number of solutions is o(l) as n — > oo, then r k < r by Markov's 
inequality. Proofs of this kind are called first moment arguments (cf. O). They show that r# < 2*Tn2 in 
it-SAT, r k < 2 k - y \n2 in &-NAE, and r k < kink in it-coloring. 

For general values of k the best current lower bounds on r k are obtained via the second moment 
method. The simple idea is to bound the expectation of the squared number of solutions for a given 
density r. To be precisely, let X =X(n,m) denote the (random) number of solutions. If E(X) ^> 1 and 
E(X 2 ) = 0(E(X) 2 ), then the Paley-Zigmund inequality 

p[*>o la §ffl; 

entails that the probability that there is a solution remains bounded away from zero for arbitrarily large 
n. Hence, the sharp threshold result implies that rt> r = m/n. 

The second moment method can be applied quite directly to both &-NAE and ^-coloring OIH. In 
the latter case computing the second moment amounts to a challenging optimization problem over the 
Birkhoff polytope. In &-SAT, by contrast, it is necessary to assign certain weights to the solutions Q. 
The result is that r k > 2*ln2 - 0(k) in it-SAT, r k > 2*ln2 - 0(1) in fc-NAE, and r k > kink - 0{lnk) in 
^-coloring. Thus, asymptotically for large k these bounds match the first moment upper bounds, up to 
second order terms. 



3 Finding solutions efficiently 

In all three problems the density up to which solutions are known to exist exceeds the density up to 
which efficient algorithms are known to find any significantly. In &-SAT (resp. &-NAE) no algorithm is 
known to find solutions beyond r = 2 k \n(k)/k (resp. r = 2 k ~ l ln(k)/k) for large k. This is by a factor 
k/\n(k) below the threshold density. Moreover, no algorithm is known to find /^-colorings of G(n,m) for 
r = m/n > ^&ln& for large k (a factor of 2). This is in spite of intensive research on the subject. In the 
sequel we follow the discussion in JH Section 1.1]. 
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To describe the class of algorithms that have been suggested and/or analyzed we need the notion of a 
factor graph. This is a bipartite graph associated with a problem instance. Its vertices are the constraints 
and the variables. Each variable is connected with all the constraints that it occurs in. In the case of 
random CSPs with r bounded as n — > oo the factor graph has girth Q.(lnn) w.h.p. (after the removal of 
0(1) constraints). Hence, for each node v the subgraph N m (v) spanned by the vertices w that are at 
distance at most ft) from v is a tree w.h.p., provided that ft) <C Inn. 

Most algorithms that have been suggested are local. That is, the value that the algorithm assigns to 
a variable x only depends on the constraints and variables in an ft)-neighborhood N a (x) for some fixed 
number ft) (i.e., independent of n). In fact, mostly ft) = 1 or ft) = 2. Furthermore, most algorithms do not 
backtrack. That is, once a variable has been assigned, its value will never change. 

The UnitClause algorithm for &-SAT is a prototypical example. Initially the algorithm considers 
all variables unassigned. In each step it selects a variable and assigns it for good. In step t the algorithm 
checks if there is a unit clause, i.e., a clause in which precisely k — 1 literals are false due to previous 
assignments. If so, it selects an unassigned variable x t that occurs in a unit clause and sets x t so as to 
satisfy the unit clause. If not, the algorithm selects a variable x t randomly and assigns it a random value. 
In the limit of large k this simple linear time algorithm finds a satisfying assignment with a non- vanishing 
probability for densities up to r ~ | • \ |[T2l . The best current algorithm for random &-SAT is local as 
well (with co = 3) and succeeds up to (1 — Sk)2 k ln(k) /k, where — ^ [ 13]. 

In graph coloring the situation is similar. A very simple greedy algorithm (ft) = 2) succeeds up to 
density r = (~ — £k)klnk for large k. A slightly better local algorithm (ft) = 2 as well) actually works up 
to r = \k\nk (4). 

Given the simplicity of these algorithms, their (rigorous) analyses can be surprisingly demanding. 
They are mostly based on tracking the execution of the algorithm by either differential equations, Markov 
chains, or martingales. The use of these techniques seems limited to algorithms with small depth, say 
ft) = 2 or ft) = 3. Furthermore, these methods do not seem sufficient for analyzing algorithms that reassign 
variables. 

A class local of algorithms with larger depth ft) have been put forward on the basis of ideas from 
the statistical mechanics of disordered systems iPTTI . Here ft) is independent of n but not bounded a 
priori. In other words, it has to be chosen sufficiently large in terms of r and k. In each round the 
algorithm aims to assign one variable (for good). To this end the algorithm performs for each variable 
x a computation that depends on the subgraph N a {x) and the values that have been assigned to the 
variables in that subgraph previously. For instance, in &-SAT the algorithm considers the sub-formula 
of the input &-SAT formula that corresponds to N m {x). It computes the probability that in a random 
satisfying assignment of that subformula, given the values of all previously assigned variables in it, x 
takes the value true/false. Then, the algorithm selects the variable for which this computation yields 
the largest bias towards either value and assigns the variable accordingly. The computation on the sub- 
instance N(o{x) can be performed efficiently, because N a (x) is acyclic w.h.p. In fact, the computation 
can be implemented to run simultaneously for all variables by means of a message passing procedure 
("Belief Propagation"). The Survey Propagation algorithm is a somewhat more involved variant of this 
strategy (see ifTTI for details). 

Since this type of algorithm crucially requires large ft) (for the estimates of the marginals to be 
accurate), its rigorous analysis is beyond current methods. Experimentally algorithms based on this 
scheme, namely, Belief/Survey Propagation guided decimation, outperform any other known ones by far 
for small k. However, for large k experimental evidence is difficult to come by. For instance, in fc-SAT 
the relevant density scales exponentially in k. 
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4 An algorithmic barrier? 

On the basis of non-rigorous but sophisticated techniques from statistical mechanics a hypothesis has 
been put forward that might explain the demise of local algorithms way below the threshold for the 
existence of solutions [17]. This hypothesis concerns the solution space. For a CSP instance <J> we 
let S(<t>) signify the set of all solutions of <J>. For instance, if <I> is a k-SAT formula with n variables, 
then S(<t>) C {0, 1}" is the set of all satisfying assignments. Similarly, if <I> is a graph on n vertices, 
then S(<t>) C {1, . . . ,k}" is the set of all ^-colorings. We turn the set S(<t>) into a graph by considering 
a, T G S(<t>) adjacent if their Hamming distance equals one. 

For random CSP instances with densities below the threshold the size of S(<t>) is exponential in n 
w.h.p. More precisely, in both ^-coloring and &-NAE we have |>S(<i>)| = E(|»S(<i>)|) -exp(o(?i)) w.h.p., and 
the first moment E(|5($)|) is easily computed [1]. By contrast, in fc-SAT we have |S(<&)| < E(|5($)|) • 
exp(— w.h.p., but |5(4>)| > E(|5($)|) -exp(— ^n) w.h.p., where Q — » exponentially for large 

A- died. 

The dynamic replica symmetry breaking ("dRSB") hypothesis states that in each of fc-SAT, fc-NAE, 
and /^-coloring there is a density r^RSB < 1 below the threshold for the existence of solutions where 
the shape of the set S(<J>) undergoes a phase transition. Furthermore, r^RSB coincides asymptotically 
with the density up to which local algorithms are known to find solutions. That is, in the large k limit 
r dRSB ~ 2* \n(k) /kin fc-SAT, r^sB ~ ln(fc) /k in fc-NAE, and r^RSB ~ |&ln k in ^--coloring. According 
to the dRSB hypothesis, for densities r < r^RSB the graph S{<&) is essentially connected w.h.p. More 
precisely, there is a single component that contains a 1 — o(l) fraction of all solutions. By contrast, 
for densities r > t^rsb there are exponentially many components, none of which contains more than an 
exponentially small fraction of all solutions. This means that for r < r^ss the correlations among the 
variables that shape the set S(<&) are purely local, whereas for r > r^RSB long range correlations arise. 

Confirming and elaborating on this hypothesis, we recently established a good bit of the dRSB phe- 
nomenon rigorously [1J. We proved that beyond the conjectured densities r^Rss the set S(<i>) decomposes 
into exponentially small well-separated components in &-SAT, &-NAE, and A-coloring. Furthermore, 
each component is very rigid locally. To be precise, we say that a variable x is frozen in a solution 
a if any solution x such that t(^) ^ a{x) has at least a linear Hamming distance Q.(n) from a. In 
other words, changing the value of x necessitates changing the values of Q.{n) other variables. Then for 
r > (1 +£k)rdRSB in all but a o(l)-fraction of all solutions all but an -fraction of the variables are frozen 
w.h.p., where Eu — » for large k. 

This suggests that on random instances with density r > (1 + EkjrjRss local algorithms are unlikely 
to succeed. For a local search algorithm assigns variables x only on the basis of the values of variables 
that have distance at most CO from x in the factor graph, where CO = 0(1) is bounded as n —> oo. But 
the presence of frozen variables yields mutual constraints on the values that can be assigned to variables 
at distance £2 (Inn) from x in the factor graph. Local algorithms do not seem capable of catching these 
long-range effects. 

The above discussion applies to "large" values of k (say, k > 10). Non-rigorous arguments as well 
as experimental evidence iTTOl suggest that the picture is quite different and rather more complicated for 
"small" k. This may be the reason why local algorithms such as Survey Propagation guided decimation 
fare extremely well for, e.g., random A-SAT with k = 3,4,5. Whether or not algorithms of this type 
succeed beyond (1 + £)r,jRSB for large k and any fixed £ > remains an important open problem. A 
plausible scenario may be that such algorithms succeed up to r = (1 + £k)rdRSB for some — > 0. 
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5 Conclusion 

Random instances of constraint satisfaction problems exhibit a phase transition with respect to the exis- 
tence of solutions. In addition, there is strong evidence that at a much lower constraint density a further 
transition takes place that affects the performance of local algorithms. In statistical mechanics terms, 
this is know as dynamic replica symmetric breaking. Roughly speaking, while below the density t^rsb 
conceptually fairly simple algorithms find solutions efficiently, no efficient algorithm is known to find 
any beyond that density (for general values of k). This appears to be due to a transition in the geometry of 
the set of all solutions, which shatters into exponentially small components and exhibits frozen variables 
beyond r^Rss- Coping with problem instances of this type poses an algorithmic challenge. 

Virtually all algorithms that have been suggested/analyzed for sparse random CSPs are (essentially) 
local. It seems plausible that such algorithms have a hard time catching the long-range correlations that 
occur beyond tj^sb- However, proving this in any generality is an open problem. 

Global algorithmic techniques such as spectral methods or semidefinite programming apply to classes 
of randomly generated CSP instances that have essentially a single solution and a sufficiently high con- 
straint density (way beyond the threshold for the existence of solutions in the models discussed 
here) El [15]]. One way of generating such instances is by "planting" a solution in an otherwise ran- 
dom instance. The success of spectral methods implies the success of Belief Propagation in, for instance, 
random 3-coloring [ 14]. But global methods do not seem to apply to instances of relatively low density 
(i.e., below the threshold r^). 

Thus, no efficient algorithms are known to solve random CSP instances with density t^rsb < r < 
for general k. Moreover, this seems to be a fairly universal fact, independent of the precise CSP under 
consideration. It might be interesting to investigate how alternative models of computation or alternative 
algorithmic paradigms fare on such inputs. 
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